Reducing Recovery Time in a Small Recursively Restartable System
نویسندگان
چکیده
We present ideas on how to structure software systems for high availability by considering MTTR/MTTF characteristics of components in addition to the traditional criteria, such as functionality or state sharing. Recursive restartability (RR), a recently proposed technique for achieving high availability, exploits partial restarts at various levels within complex software infrastructures to recover from transient failures and rejuvenate software components. Here we refine the original proposal and apply the RR philosophy to Mercury, a COTS-based satellite ground station that has been in operation for over 2 years. We develop three techniques for transforming component group boundaries such that time-to-recover is reduced, hence increasing system availability. We also further RR by defining the notions of an oracle, restart group and restart policy, while showing how to reason about system properties in terms of restart groups. From our experience with applying RR to Mercury, we draw design guidelines and lessons for the systematic application of recursive restartability to other software systems amenable to RR.
منابع مشابه
Designing for High Availability and Measurability
We propose a structuring model, called recursive restartability, aimed at controlling the amount of endto-end unavailability and improving the measurability of software infrastructures with high availability requirements. Recursive restartability exploits the benefits of restarts at various levels within complex software systems and relies on an execution infrastructure to monitor, cure, and re...
متن کاملAn Implementation of User-level Restartable Atomic Sequences on the NetBSD Operating System
This paper outlines an implementation of restartable atomic sequences on the NetBSD operating system as a mechanism for implementing atomic operations in a mutual-exclusion facility on uniprocessor systems. Kernel-level and user-level interfaces are discussed along with implementation details. Issues associated with protecting restartable atomic sequences from violation are considered. The perf...
متن کاملDiscrete Time Analysis of Multi-Server Queueing System with Multiple Working Vacations and Reneging of Customers
This paper analyzes a discrete-time $Geo/Geo/c$ queueing system with multiple working vacations and reneging in which customers arrive according to a geometric process. As soon as the system gets empty, the servers go to a working vacations all together. The service times during regular busy period, working vacation period and vacation times are assumed to be geometrically distributed. Customer...
متن کاملFast Mutual Exclusion for Uniprocessors Brian
In this paper we describe restartable atomic sequences, an optimistic mechanism for implementing simple atomic operations (such as Test-And-Set) on a uniprocessor. A thread that is suspended within a restartable atomic sequence is resumed by the operating system at the beginning of the sequence, rather than at the point of suspension. This guarantees that the thread eventually executes the sequ...
متن کاملEvaluating the Recovery Process of Renal Ischemia/Reperfusion Injury in Rats Using Small-Animal SPECT
Background: Renal injuries associated with ischemia/reperfusion are a prevalent clinical phenomenon that can cause the emergence of progressive kidney diseases, eventually leading to chronic kidney injuries. The present study was conducted to evaluate the results obtained from non-invasive imaging using small-animal SPECT and investigate the recovery process in an animal model of renal ischemia...
متن کامل